1 Introduction

We seek to build a pricing structure for a car insurance product. The pure or technical premium \(\pi_i\) per policyholder \(i\) can be determined as follows: \[ \pi_i = E\left[\frac{L_i}{d_i}\right] = E\left[\frac{N_i}{d_i}\right] \cdot E\left[\frac{L_i}{N_i}| N_i > 0\right] \ ,\] where the random variable \(L_i\) is the total charge resulting from \(N_i\) (r.v.) reported claims by a policyholder \(i\) over its exposure period \(d_i\). The second equality holds assuming independence between \(N_i/d_i\) (frequency component) and \(L_i/N_i\) (severity component). The goal is to use the policyholder available information in modeling both components of the above equation and based on which a pricing grid can be obtained.

We will start by exploring the dataset to get some insight.

2 Exploratory Data Analysis

2.1 Factor covariates

The dataset comprises of 163657 observations and the following variables:

## Classes 'tbl_df', 'tbl' and 'data.frame':    163657 obs. of  17 variables:
##  $ ageph   : int  64 28 58 37 29 25 34 39 52 46 ...
##  $ codposs : int  1000 1000 1000 1030 1030 1030 1040 1040 1050 1050 ...
##  $ duree   : num  1 0.0466 0.4027 0.1699 1 ...
##  $ lnexpo  : num  0 -3.067 -0.909 -1.773 0 ...
##  $ nbrtotc : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ nbrtotan: num  0 21.5 0 0 0 ...
##  $ chargtot: num  0 156 0 0 0 ...
##  $ agecar  : Factor w/ 4 levels "0-1","2-5","6-10",..: 2 3 4 2 3 4 4 4 3 3 ...
##  $ sexp    : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fuelc   : Factor w/ 2 levels "Gasoil","Petrol": 2 2 2 2 2 2 2 2 2 2 ...
##  $ split   : Factor w/ 4 levels "Once","Twice",..: 1 2 3 1 1 2 4 4 1 4 ...
##  $ usec    : Factor w/ 2 levels "Private","Professional": 1 1 1 2 1 1 1 1 1 1 ...
##  $ fleetc  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ sportc  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ coverp  : Factor w/ 3 levels "MTPL","MTPL+",..: 2 1 1 3 2 2 1 1 2 1 ...
##  $ powerc  : Factor w/ 3 levels "<66","66-110",..: 2 2 1 2 1 2 1 1 2 2 ...
##  $ agecat  : Factor w/ 5 levels "17-20","21-30",..: 4 2 3 3 2 2 3 3 3 3 ...

We first explore the dataset to get some insight on the data and see in particular how balanced the data is across the levels of the categorical variables. For exploratory purposes, a new factor variable agecat has been artificially generated for the age category of the policyholder with levels “17-20,” “21-30,” “31-60,” “61-80” and “81-95.” The minimum and maximum ages encountered in the portfolio are 17 and 95 respectively.

Although sexp will be used for exploration purposes and model fitting, it is to be noted that for pricing purposes, the use of the policyholder’s gender is prohibited following a European directive (1992).

We see that the proportion of males reaches 73.6% of the entire number of policyholders. Logically, policyholders in age category 31-60 forms the majority (64.8%). The proportion of policyholders splitting their premium once reaches 49.6%.

We see that most cars (73.2%) have a power <66. The most common type of cover (58.3%) is MTPL. Almost all policyholders don’t have a sport car, nor a car which is part of a fleet.

As can be logically expected from the previous charts, we see that most policyholders used their car for private purposes. 69.2% of policyholders drive a petrol car.

2.2 Total number of claims

We plot the histogram of nbrtotc and compute the empirical mean and variance to evaluate which distribution assumption we would make sense in the maximum likelihood estimation.

At first sight, a Poisson distribution is a reasonable assumption. Let’s verify whether the data is equidispered or not. We obtain:

This shows that the data is overdispered, so a Negative Binomial distribution (in an MLE context) or a quasi-Poisson likelihood approach could be suitable alternatives.

2.3 Claim freq. per exposure

We plot the empirical claim frequency of the portfolio (total number of claims divided by the total exposure) obtained for each level of the factor covariates.

We see that the policyholders under 20 have the highest level of risk (although they are clearly underrepresented in the dataset). This is also the case for the policyholders driving a car under 1 year old and those splitting their premium thrice a year or monthly.

Cars with more power (>110) as well as sports car also seem to represent a slightly higher risk.

Cars used for private purposes have equal risk level than cars used for professional purposes and those fueled with Gasoil have a slightly higher level of risk.

2.4 Using postal codes

We now explore the postal codes. Out of practice, we use the code seen during the computer lab (Katrien Antonio 2020). We compute the total of the exposure for each post code. We then merge the result with the belgium dataset and compute the relative exposure per unit of area. Finally, we convert the relative exposure per unit of area into a factor variable and we plot the result.

We observe that certain regions have a higher relative exposure, such as Brussels, Charleroi, Liege and Antwerp.

Doing the same for the relative severity per area unit, we obtain:

2.5 Claim severity

We now turn to the exploration of claim severity chargtot. We here first plot its empirical distribution from records showing at least 1 claim:

Most claim amounts are small but on rare occasion can be very large. As the distribution is skewed to the right, this would justify the assumption of a Gamma or log-normal distribution. We further observe 3 “spikes” for claim severities below 5000. As an alternative, one could attempt the “splicing” method to create a tailored distribution accounting for these spikes. This approach would especially make sense if similar spikes would be observed in other datasets as well.

To get further insight of the premium level for different risk profiles (considering each categorical variable separately), we now compute the total charges per exposure unit. We then average the amount obtained for each level of the categorical variables and plot the results.

plot.avg.sev.pe <- function(df,x,xlab) {
  g1 <- df %>% mutate(sev.pe=chargtot/duree) %>%
    group_by_(as.name(x)) %>%           
    summarize(avg.sev.pe = sum(sev.pe)/n(), n=n(),tot.sev.pe=sum(sev.pe))
  g2 <- g1 %>% 
    ggplot(data=., aes(x=g1[[x]], y=avg.sev.pe)) + theme_bw() +
    geom_bar(position = 'dodge', stat='identity',fill=KULbg) + 
    geom_text(aes(label=round(avg.sev.pe, digits = 1)), position=position_dodge(width=0.9), vjust=-0.25) +
    labs(x=xlab,y="Average severity per exposure unit")
  (g2)
}
j1 <- plot.avg.sev.pe(df,"sexp","Gender")
j2 <- plot.avg.sev.pe(df,"agecar","Age of the car")
j3 <- plot.avg.sev.pe(df,"split","Premium split")
j4 <- plot.avg.sev.pe(df,"powerc","Power")
j5 <- plot.avg.sev.pe(df,"coverp","Cover")
j6 <- plot.avg.sev.pe(df,"sportc","Sport car")
j7 <- plot.avg.sev.pe(df,"fleetc","Fleet")
j8 <- plot.avg.sev.pe(df,"usec","Use")
j9 <- plot.avg.sev.pe(df,"fuelc","Fuel")
j10 <- plot.avg.sev.pe(df,"agecat","Age category")

Considering each categorical variable separately, the highest severities on average per exposure unit are obtained for females, policyholders aged under 20, for a car aged 0-1 and policyholders splitting their premium thrice a year or monthly respectively.

Similarly for cars with power 66-110, policyholders with MTPL and MTPL+++ coverage, sport and non-fleet cars (again each categorical variable considered separately).

Here we see that the average severity is higher for private use and petrol fueled cars.

Finally, if we don’t take the exposure into account for the computation of the average claim severity (in the idea that it is included in the model of the claim frequency), we obtain the following results:

plot.avg.sev.pc <- function(df,x,xlab) {
  g1 <- df %>% filter(nbrtotc > 0) %>% mutate(sev.pc=chargtot/nbrtotc) %>%
    group_by_(as.name(x)) %>%           
    summarize(avg.sev.pc = sum(sev.pc)/n(), n=n(),tot.sev=sum(sev.pc))
  g2 <- g1 %>% 
    ggplot(data=., aes(x=g1[[x]], y=avg.sev.pc)) + theme_bw() +
    geom_bar(position = 'dodge', stat='identity',fill=KULbg) + 
    geom_text(aes(label=round(avg.sev.pc, digits = 1)), position=position_dodge(width=0.9), vjust=-0.25) +
    labs(x=xlab,y="Average severity per claim")
  (g2)
}
k1 <- plot.avg.sev.pc(df,"sexp","Gender")
k2 <- plot.avg.sev.pc(df,"agecar","Age of the car")
k3 <- plot.avg.sev.pc(df,"split","Premium split")
k4 <- plot.avg.sev.pc(df,"powerc","Power")
k5 <- plot.avg.sev.pc(df,"coverp","Cover")
k6 <- plot.avg.sev.pc(df,"sportc","Sport car")
k7 <- plot.avg.sev.pc(df,"fleetc","Fleet")
k8 <- plot.avg.sev.pc(df,"usec","Use")
k9 <- plot.avg.sev.pc(df,"fuelc","Fuel")
k10 <- plot.avg.sev.pc(df,"agecat","Age category")

We obtain similar differences across the levels of each categorical variable, except for the variables “car power” and “sport car.”

As discussed in (Henckaerts et al. 2018), it is to be noted that it is more difficult to model the claim severity as a function of the policyholders characteristics, as part of the claim total cost in a car accident cannot be only predicted by the policyholders own characteristics (e.g. damages caused to a third-party).

2.6 Exposure

To end this section, we also plot the distribution of the exposure duree.

We observe that nearly all policyholders are covered during the entire exposure period. Filtering the original dataset on the policyholders with exposure strictly lower than one, we obtain:

The latter shows 11 regularly spaced spikes, corresponding to monthly durations, as a contract will in most cases start or end at the beginning/end of a given month.

2.7 Further exploration

Other plots can be made to further explore the dataset. For example, the plot below shows the relative frequency of filing from 0 to 5 claims for the different age categories. One observes for example that the proportion of policyholders aged 21-30 increases with the number of claims and inversely for policyholders aged 61-80.

3 Training and test datasets

In (Henckaerts et al. 2020), one starts dividing the initial dataset into 6 folds (after sorting the variables of interest to account for unbalanced data) in order to run then a 5-fold cross-validation on these 6 different folds. Out of practice, we reproduce this approach here to divide the original dataset, but we will only fit our models on a particular training and test datasets.

df <- df %>% arrange(nbrtotc,chargtot,duree) %>%
             mutate(fold = paste0('data', rep(seq_len(6), length = nrow(df))))
df %>% group_by(fold) %>% 
       summarise(sum(nbrtotc) / sum(duree), sum(chargtot) / sum(nbrtotc))

In this report, we will use data fold 2 as test dataset and the other data folds combined as the training dataset.

df.train <- df %>% filter(fold != 'data2')
df.test <- df %>% filter(fold == 'data2')

4 Modeling the claim frequency

4.1 Using GLM

4.1.1 General setting

As presented in the introduction, \(N_i\) is the random variable for the number of claims reported by policyholder \(i\) (measured by nbrtotc) over its exposure \(d_i\) (measured by duree). As a first try, we use a Poisson distribution to model \(N_i\). We want to model \(N_i\) while taking into account \(d_i\) and a set of risk factors \(\textbf{x}_i\) (e.g. ageph, levels of sexp, etc.). To do so, we set \[ N_i \sim Poi(d_i \cdot e^{\textbf{x}_i'\beta}) \ ,\] where \(\textbf{x}_i=(1, x_{i1},x_{i2},...)'\) and \(\beta=(\beta_0,\beta_1,...)'\). In this way, the mean of the distribution of \(N_i\) will then be determined based on the risk profile of the policyholder and adjusted for its exposure \(d_i\). The presence of the exponential function in the linear predictor \(\textbf{x}_i'\beta\) guarantees that the mean of the Poisson distribution is positive (as it should since it has as support the set of positive integers). By definition, we have \(E[N_i|\textbf{x}_i] = d_i e^{\textbf{x}_i'\beta}\). Consequently, the expression \(e^{\textbf{x}_i'\beta}\) can be seen as the expected annual claim frequency given the risk factors \(\textbf{x}_i\) of policyholder \(i\). We start exploring with two simple models.

4.1.2 Toy model A

For example, considering the risk classes “Female” and “Male” only, let \(\textbf{x}_i=(1, x_{i1})'\), where \(x_{i1}=1\) if sexp is “Male” for policyholder \(i\) and \(0\) otherwise (sexp is Female) and \(\beta=(\beta_0,\beta_1)\). Fitting this model with the glm function, the Maximum Likelihood method leads to the estimates \(\hat{\beta}_0=-1.908\) (SE=\(0.013\)) and \(\hat{\beta}_1=-0.086\) (SE=\(0.016\)). This information is taken from the output of the summary function applied on the model:

tmA <- glm(data=df.train, nbrtotc ~ sexp, offset = lnexpo, family=poisson(link="log"))
round(summary(tmA)$coefficients,digits=3)
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -1.902      0.015 -130.81        0
## sexpMale      -0.094      0.017   -5.51        0

Using the estimated coefficients \(\hat{\beta}=(\hat{\beta_0},\hat{\beta_1})\), we can then compute the expected annual claim frequency by gender:

  • For sexp = Female, we have \(e^{\hat{\beta}_0 + \hat{\beta}_1 x_{i1}}=e^{\hat{\beta}_0}=0.1484\)
  • For sexp = Male, we have \(e^{\hat{\beta}_0 + \hat{\beta}_1 x_{i1}}=e^{\hat{\beta}_0+\hat{\beta}_1}=0.1361\)

These results correspond to the empirical claim frequencies obtained in the previous chapter where we looked at the overall number of claims divided by the total exposure for each gender. This is normal since the Maximum Likelihood estimator \(\beta_{MLE}\) under a Poisson likelihood is such that \[e^{\textbf{x}_{gender(i)}'\beta_{MLE}}=\frac{\sum_{j|gender(j)=gender(i)}n_j}{\sum_{j|gender(j)=gender(i)}d_j} \ , \] for each policyholder \(i\).

We can also verify that these values are equal to the predicted value of the response variable divided by \(d_i\):

(tmA %>% broom::augment(type.predict="response") %>% 
       mutate(expected=round(.fitted/df.train$duree,digits=3)) %>% 
       select(sexp,expected) %>% distinct())

We can also obtain them as follows (where we set the exposure to 1 in the new dataset for which we want to predict the response variable):

X <- expand.grid(sexp=levels(df.train$sexp),
                 lnexpo=0)
X <- X %>% mutate(fitted.freq=predict(tmA,X,type="response")) %>% select(-lnexpo)
head(X)

4.1.3 Toy model B

As second illustration, let’s consider the age category as risk factor and fit a GLM.

Similarly, we obtain for the predicted value of the response variable divided by \(d_i\):

(tmB %>% broom::augment(type.predict="response") %>% 
       mutate(exp=round(.fitted/df.train$duree,digits=3)) %>% 
       select(agecat,exp) %>% distinct() %>% arrange(agecat))

We can also obtain them as follows:

X <- expand.grid(agecat=levels(df.train$agecat),
                 lnexpo=0)
X <- X %>% mutate(fitted.freq=predict(tmB,X,type="response")) %>% select(-lnexpo)
X

4.1.4 Model 1 (glm1_freq)

We now add multiple risk factors and assess the model fit by analyzing the drop in deviance.

glm1_freq <- glm(data=df.train, nbrtotc ~ sexp + agecat + agecar + fuelc + split + coverp + powerc + usec + fleetc + sportc , offset = lnexpo, family=poisson(link="log"))

From the above output, we single out covarites agecat and split, since we have a drop in deviance of

  • 912.36 when adding covariate agecat to a model with only sexp,
  • 393.11 when adding covariate split to a model with sexp, agecat, agecar and fuelc.

The covariates usec, fleetc and sportc don’t seem to bring an added value at all.

4.1.5 Model 2 (glm2_freq)

From the above results, let’s consider a model with covariates sexp, agecat, agecar, fuelc, split, coverp and powerc. Fitting a GLM, we obtain the following results:

The following table gives the coefficient estimates and their level of significance.

##                  Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)   -0.86120447 0.19321326  -4.457274 8.300839e-06
## sexpMale      -0.05980141 0.01759790  -3.398214 6.782738e-04
## agecat21-30   -0.44808713 0.18988322  -2.359804 1.828461e-02
## agecat31-60   -0.87826687 0.18941214  -4.636803 3.538387e-06
## agecat61-80   -1.12189607 0.19019405  -5.898692 3.663951e-09
## agecat81-95   -0.85276266 0.20511610  -4.157463 3.218010e-05
## agecar2-5     -0.26794501 0.03572400  -7.500420 6.361374e-14
## agecar6-10    -0.20202886 0.03642079  -5.547075 2.904881e-08
## agecar>10     -0.19361327 0.03884999  -4.983612 6.240802e-07
## fuelcPetrol   -0.17426201 0.01680732 -10.368219 3.459107e-25
## splitTwice     0.18059213 0.01843586   9.795696 1.174853e-22
## splitThrice    0.40618064 0.02662261  15.256978 1.479440e-52
## splitMonthly   0.39930710 0.02333266  17.113658 1.173895e-65
## coverpMTPL+   -0.11989566 0.01883456  -6.365727 1.943673e-10
## coverpMTPL+++ -0.15235147 0.02679513  -5.685790 1.302095e-08
## powerc66-110   0.09084301 0.01826534   4.973518 6.574878e-07
## powerc>110     0.26936824 0.07375381   3.652262 2.599403e-04

We can now compute the predicted values of the response variable for each combination of levels and sort them in ascending order.

X <- expand.grid(sexp=levels(df$sexp),
                agecat=levels(df$agecat),
                agecar=levels(df$agecar),
                fuelc=levels(df$fuelc),
                split=levels(df$split),
                coverp=levels(df$coverp),
                powerc=levels(df$powerc),
                lnexpo=0)
X <- X %>% mutate(exp.freq.glm=predict(glm2_freq,X,type="response")) %>% arrange(exp.freq.glm) %>% select(-lnexpo)

We then show the first and last 6 predicted claim frequencies.

head(X)
tail(X)

4.2 Using GBM

4.2.1 Case 1

We use the package xgboost to build our model. In order to build a cell-based tariff structure later on, we keep using the factor variable agecat. As in the GAM example in the computer lab, a better strategy would be to find the optimal binning of ageph through the data directly and not impose it.

We run a cross-validation (6-fold) to find the optimal hyperparameters. We voluntarily reduce the number of possible values to reduce the computation time here. In the interest of time, this part of the code was not executed in the generation of the notebook.

hyper_grid <- expand.grid(
  eta = 0.01,
  max_depth = 3, 
  min_child_weight = 1000, 
  subsample = 0.75, 
  colsample_bytree = 0.5,
  gamma = c(0, 1),
  lambda = c(0, 0.01),
  alpha = c(0, 0.01),
  test_poisson_nloglik_mean = 0,  
  iteration = 0
)
for(i in seq_len(nrow(hyper_grid))) {
  set.seed(123)
  xval <- xgb.cv(
    data = df_xgb,
    nrounds = 1500,
    objective = "count:poisson",
    early_stopping_rounds = 20, 
    nfold = 6,
    stratified = T,
    verbose = 0,
    params = list( 
      booster = 'gbtree',
      eval_metric = 'poisson-nloglik',
      eta = hyper_grid$eta[i], 
      max_depth = hyper_grid$max_depth[i],
      min_child_weight = hyper_grid$min_child_weight[i],
      subsample = hyper_grid$subsample[i],
      colsample_bytree = hyper_grid$colsample_bytree[i],
      gamma = hyper_grid$gamma[i], 
      lambda = hyper_grid$lambda[i], 
      alpha = hyper_grid$alpha[i]
    ) 
  )
  hyper_grid$test_poisson_nloglik_mean[i] <- min(xval$evaluation_log$test_poisson_nloglik_mean)
  hyper_grid$iteration[i] <- xval$best_iteration
}
results <- hyper_grid %>%
  filter(test_poisson_nloglik_mean > 0) %>%
  arrange(test_poisson_nloglik_mean)
results

We then fit the model on the training data again, using the optimal hyperparameters.

xgb_freq <- xgboost(
  data = df_xgb,
  nrounds = 1500,
  early_stopping_rounds = 20,
  verbose = F,
  params = list(
    booster = 'gbtree',
    objective = 'count:poisson',
    eval_metric = 'poisson-nloglik',
    eta = 0.01, nthread = 1,
    subsample = .75, colsample_bynode = .5,
    max_depth = 3, min_child_weight = 1000,
    gamma = 0, lambda = .01, alpha = .01
  )
)

Among the hyperparameters, eta is the learning rate, subsample is the proportion of the data used to generate the random samples used for each tree, min_child_weight is the minimum of observations that we want to have at each node before considering splitting further the node, gamma is the minimum loss reduction required to make a split, alpha and lambda are regularization parameters.

The result for the best iteration is given below.

Out of illustration, we plot the first tree.

For example, we can observe which covariates are used in each split and the gain obtained.

We can order the features by importance level to get some insight on the important covariates:

This ordering is obtained by computing, for each covariate, the associated total loss reduction for each tree and average it over all trees. We confirm that agecat and split are the variables with the most impact (as expected from the results obtained for GLM).

Now that we have fitted our model, we construct our grid showing all possible level combinations of our factor covariates and feed it to our model to obtain the predicted average claim frequencies for all risk profiles.

## [1] TRUE

We then obtain the predictions and add them to our grid.

For illustration, we here show the lowest/highest 6 predictions (pred) and the associated risk profiles.

head(X)
tail(X)

From this, we see that both models classify the risk profiles differently. We observe that the predicted average frequencies also differ in the predicted ranges (except for the lower bound):

tibble(range.glm=range(X$exp.freq.glm),range.xgb=range(X$exp.freq.xgb))

4.2.2 Case 2

To compare the GBM obtained with XGBoost, we here consider another GBM, where we use the gbm function and the Poisson distribution from Harry Southworth’s gbm package (Southworth, n.d.). We here follow the approach explained in (Henckaerts et al. 2020) and (Henckaerts, n.d.).

We use the gbm function to model a GBM as follows:

features <- c("sexp","agecat","agecar","fuelc","split","coverp","powerc")
gbm_freq <- gbm(
  formula = as.formula(paste('nbrtotc ~ offset(lnexpo) +', paste(features, collapse = ' + '))),
  data = df.train,
  distribution = 'poisson',
  n.trees = 1400, 
  interaction.depth = 5, 
  shrinkage = 0.01, 
  bag.fraction = 0.75, 
  n.minobsinnode = 0.01 * 0.75 * nrow(df.train), 
  verbose = FALSE
)

Here, shrinkage is the learning rate, bag.fraction is equivalent to subsample for XGBoost and n.minobsinnode is equivalent to min_child_weight. It is here initialized as a percentage of the number of observations in our training dataset (n.minobsinnode \(\simeq\) 1000).

To investigate the importance of each covariate in the GBM, we make use of the following functions from (Henckaerts, n.d.), which can also be used for regression trees and random forest models fitted using the distRforest package.

predict_model.gbm <- function(object, newdata) {
  predict(object, newdata, n.trees = object$n.trees, type = 'response')
}
var_imp <- function(object) {
  # Calculate non-normalized importance scores based on the model class
  switch(class(object)[1],
         'rpart' = data.frame(variable = names(object$variable.importance),
                              importance = object$variable.importance),
         'rforest' = object %>% distRforest::importance_rforest() %>% 
           dplyr::select(variable, importance),
         'gbm' = object %>% summary(plotit = FALSE, normalize = FALSE) %>% 
           setNames(c('variable', 'importance'))
  ) %>% 
    # Normalize the scores to sum to one
    dplyr::mutate(scale_sum = round(importance / sum(importance), digits = 4))
}
plot_var_imp <- function(data) {
  data %>% ggplot(aes(x = reorder(variable, scale_sum), y = scale_sum)) + theme_bw() +
    geom_bar(stat = 'identity', fill=KULbg) + coord_flip() + 
    labs(x = '', y = 'importance')
}

We obtain the following results, showing that agecat and split are again detected as important covariates.

To go further, one can draw the partial dependence plots (PDP) associated to a each level of a categorical variable. To do so, one computes the average of the predictions obtained on a given dataset where the level is kept fixed to each of the possible levels. We again use an adapted version of a function from (Henckaerts, n.d.):

For example, for the age category we obtain the graph below.

On the above graph, we see that younger policyholders are evaluated as riskier. Then it decreases up to age category 61-80 and increases again for 81-100.

As in (Henckaerts, n.d.), we now plot the so called Individual Conditional Expectation (ICE) plots, showing the effect of each level for each prediction. This plot can be used to detect interaction effects, which were averaged away from the partial dependence plots.

As illustration, we here plot the ICE for the age category:

We now add the predicted average claim frequencies to our grid.

For the age category by gender, we obtain:

We see that for GBM’s, we notice the “inversion” of the curves, meaning that before age category 31-60, males are predicted to have a higher claim frequency than females, while the other way around from 31-60 onward. We also note the higher values reached with GLM compared to GBM for younger ages in particular.

For the age category by split, we obtain:

For the age category by power of the car, we obtain:

For the age category by age of the car, we obtain:

5 Modeling the claim severity

To model the claim severity, we will only consider records from the original dataset which have at least one claim.

df.train.sev <- df.train %>% filter(nbrtotc>0) %>% mutate(avg.sev=chargtot/nbrtotc)

5.1 Using GLM

After several attempts, we considered a GLM including only coverp as covariate. Adding other covariates always produced non-significant coefficients. One could explain this by the fact that in practice the severity may mainly be influenced by characteristics external to the policyholder itself.

We obtain the following result:

summary(glm_sev)
## 
## Call:
## glm(formula = avg.sev ~ coverp, family = Gamma(link = "log"), 
##     data = df.train.sev, weights = nbrtotc)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.511  -1.780  -1.023  -0.192  47.241  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.47890    0.10458  71.516   <2e-16 ***
## coverpMTPL+   -0.36387    0.18959  -1.919    0.055 .  
## coverpMTPL+++  0.06969    0.24693   0.282    0.778    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 111.6813)
## 
##     Null deviance: 42263  on 15287  degrees of freedom
## Residual deviance: 41816  on 15285  degrees of freedom
## AIC: 277724
## 
## Number of Fisher Scoring iterations: 7

The estimated coefficients are as follows (but they are not significant at the 5% level, except the intercept):

summary(glm_sev)$coefficients
##                  Estimate Std. Error    t value  Pr(>|t|)
## (Intercept)    7.47890420  0.1045767 71.5159854 0.0000000
## coverpMTPL+   -0.36386831  0.1895872 -1.9192666 0.0549692
## coverpMTPL+++  0.06968918  0.2469265  0.2822264 0.7777737

The predicted values of the response variable are then obtained as follows:

X.glm <- expand.grid(coverp=levels(df.train.sev$coverp),lnexpo=0)
X.glm <- X.glm %>% mutate(exp.sev.glm=predict(glm_sev,X.glm,type="response")) %>% arrange(exp.sev.glm) %>% select(-lnexpo)
X.glm

We also verify that these results are in line with the following summary statistics:

df.train.sev %>% group_by(coverp) %>% summarise(m=sum(chargtot)/sum(nbrtotc))

This is due to the fact that we used nbrtotc as weight parameter in the glm function. If not, we would have obtained the same summary statistics as the one obtained in the data exploration analysis for severity.

5.2 Using GBM

5.2.1 Case 1

We use a Gamma deviance to build a severity model with xgboost. For the Gamma loss, we use Harry Southworth’s implementation (Southworth, n.d.).

## Loading required package: usethis
## Skipping install of 'gbm' from a github remote, the SHA1 (9c11f642) has not changed since last install.
##   Use `force = TRUE` to force installation

We run a cross-validation (6-fold) for different combination of hyperparameter values to determine the optimal parameters. To reduce the computation time, we voluntarily restrict the combinations of hyperparameters we want to test. In the interest of time, this part of the code was not executed in the generation of this notebook.

hyper_grid <- expand.grid(
  eta = 0.01,
  max_depth = 3, 
  min_child_weight = 3,
  subsample = 0.75, 
  colsample_bytree = 0.5,
  gamma = c(0, 1),
  lambda = c(0, 0.01),
  alpha = c(0, 0.01),
  test_gamma_nloglik_mean = 0,  
  iteration = 0
)
for(i in seq_len(nrow(hyper_grid))) {
  set.seed(123)
  xval <- xgb.cv(
    data = df_xgb,
    label = df1$avg,
    nrounds = 1500,
    objective = "reg:gamma",
    early_stopping_rounds = 20, 
    nfold = 6,
    stratified = T,
    verbose = 0,
    params = list( 
      booster = 'gbtree',
      eval_metric = 'gamma-nloglik',
      eta = hyper_grid$eta[i], 
      max_depth = hyper_grid$max_depth[i],
      min_child_weight = hyper_grid$min_child_weight[i],
      subsample = hyper_grid$subsample[i],
      colsample_bytree = hyper_grid$colsample_bytree[i],
      gamma = hyper_grid$gamma[i], 
      lambda = hyper_grid$lambda[i], 
      alpha = hyper_grid$alpha[i]
    ) 
  )
  hyper_grid$test_gamma_nloglik_mean[i] <- min(xval$evaluation_log$test_gamma_nloglik_mean)
  hyper_grid$iteration[i] <- xval$best_iteration
}
results <- hyper_grid %>%
  filter(test_gamma_nloglik_mean > 0) %>%
  arrange(test_gamma_nloglik_mean)
results

We then fit a new model using the optimal hyperparameters.

xgb_sev <- xgboost(
  data = df_xgb,
  label = df.train.sev$avg,
  weight = df.train.sev$nbrtotc,
  nrounds = 1500,
  early_stopping_rounds = 50,
  verbose = FALSE,
  params = list(
    booster = 'gbtree',
    objective = 'reg:gamma',
    eval_metric = 'gamma-nloglik',
    eta = .01, nthread = 1,
    subsample = .75, colsample_bytree = 0.5,
    max_depth = 3, min_child_weight = 1000,
    gamma = 1, lambda = 0, alpha = .01
  )
)

We can order the feature by importance level:

xgb.ggplot.importance(
  importance_matrix = xgb.importance(
    feature_names = colnames(df_xgb),
    model = xgb_sev
  )
)

Although the variable agecar seems to have the most predictive impact, this is not as striking as what we obtained for the claim frequency.

We then compute the predicted average claim severities for all risk profiles:

X.sev <- expand.grid(powerc=levels(df$powerc),
                    agecar=levels(df$agecar),
                    coverp=levels(df$coverp),
                    fuelc=levels(df$fuelc))
X.sev_xgb <- xgb.DMatrix(data = X.sev %>% data.matrix)
X.sev$avg.sev.xgb <- xgb_sev %>% predict(newdata = X.sev_xgb, n.trees = xgb_sev$niter,type="response")
X.sev <- X.sev %>% arrange(avg.sev.xgb)

For illustration, we show the lowest/highest average claim severities.

head(X.sev)
tail(X.sev)

5.2.2 Case 2

Inspired by (Henckaerts et al. 2020) and (Henckaerts, n.d.), we model the average severity using a Gamma GBM, where we use Harry Southworth’s implementation (Southworth, n.d.) for the Gamma loss function.

We use the gbm function to model a GBM as follows:

features <- c("powerc","agecar","coverp","fuelc")
#devtools::install_github('harrysouthworth/gbm')
gbm_sev <- gbm(
  formula = as.formula(paste('avg.sev ~', paste(features, collapse = ' + '))),
  data = df.train.sev,
  weights = nbrtotc,
  distribution = 'gamma',
  n.trees = 1500, 
  interaction.depth = 1, 
  shrinkage = 0.01, 
  bag.fraction = 0.75, 
  n.minobsinnode = 0.01 * 0.75 * nrow(df.train.sev), 
  verbose = FALSE
)

X.sev$avg.sev.gbm <- gbm_sev %>% predict(X.sev %>% select(-avg.sev.xgb),n.trees = gbm_sev$n.trees, type='response')

For illustration, we show the lowest/hights predicted average claim severities, which seem to be of the same order.

head(X.sev)
tail(X.sev)

6 Model performance

Inspired by (Henckaerts, n.d.), we apply our models to the test dataset to assess their statistical performance.

The Poisson and Gamma deviances are use as alternatives to the Normal deviance to match the frequency and severity empirical properties (discussed in the chapter on the data exploration analysis). We implement them as in (Henckaerts, n.d.):

# Poisson deviance
dev_poiss <- function(ytrue, yhat) {
  -2 * mean(dpois(ytrue, yhat, log = TRUE) - dpois(ytrue, ytrue, log = TRUE), na.rm = TRUE)
}
# Gamma deviance
dev_gamma <- function(ytrue, yhat, wcase) {
  -2 * mean(wcase * (log(ytrue/yhat) - (ytrue - yhat)/yhat), na.rm = TRUE)
}

We compute the Poisson deviance for the frequency models:

oos_pred %>% dplyr::select(ends_with('_freq_pred')) %>% 
  purrr::map(~ dev_poiss(df.test$nbrtotc, .x * df.test$duree))
## $glm2_freq_pred
## [1] 0.5506731
## 
## $xgb_freq_pred
## [1] 0.5504147
## 
## $gbm_freq_pred
## [1] 0.5356918

From the above results, we see that the lowest value is obtained with the GBM model (using the gbm function).

We then compute the Gamma deviance for the severity models:

oos_pred %>% dplyr::select(ends_with('_sev_pred')) %>% 
  purrr::map(~ dev_gamma(df.test$chargtot/df.test$nbrtotc, .x, df.test$nbrtotc))
## $glm_sev_pred
## [1] 2.626354
## 
## $xgb_sev_pred
## [1] 2.641206
## 
## $gbm_sev_pred
## [1] 2.645488

From the above results, we notice that for this data fold, the GBM model (using the gbm function) does not have the lowest value. As we saw in the lecture, the “ranking” between the different models is not so striking for the severity models, which may be the reflection of the model limitation in trying to predict the severity based on policyholders characteristics only.

7 Pure premiums

As seen in the introduction, the pure or technical premium \(\pi_i\) of policyholder \(i\) is determined by the following expression:

\[ \pi_i = E\left[\frac{L_i}{d_i}\right] = E\left[\frac{N_i}{d_i}\right] \cdot E\left[\frac{L_i}{N_i}| N_i > 0\right] \ . \]

Using models glm2_freq for the claim frequency and glm_sev for the average claim severity, we compute the premiums denoted by riskP.glm. We note however that our model glm_sev was not shown as significant. Then using both XGBoost GBM models, we compute the risk premiums denoted by riskP.xgb, where the model xgb_freq is used for the claim frequency and xgb_sev for the average claim severity. Finally, using both GBM models fitted with the gbm function, we compute the risk premiums denoted by riskP.gbm, where model gbm_freq is used for the claim frequency and gbm_sev for the average claim severity.

premiums <- X %>% left_join(X.glm,by=c("coverp")) %>%
            mutate(riskP.glm=exp.freq.glm*exp.sev.glm) %>%
            left_join(X.sev,by=c("powerc","agecar","fuelc","coverp")) %>%
            mutate(riskP.xgb=exp.freq.xgb*avg.sev.xgb) %>%
            mutate(riskP.gbm=exp.freq.gbm*avg.sev.gbm) %>% 
            select(-exp.freq.glm,-exp.freq.xgb,-exp.freq.gbm,-exp.sev.glm,-avg.sev.xgb,-avg.sev.gbm)

As illustration, we show below the first and last 6 risk profiles with their associated risk premiums.

8 Risk loadings

As explained in (Yang, Li, and Meng 2020), different strategies exist to determine the risk loading. As a first strategy, we can apply the expected value premium principle. To do so, we first determine the total risk premium \(C\) of the whole portfolio. Then we determine the parameter \(\phi\) such that \[ \sum_{i}^N \left( \pi_i + \phi \pi_i \right) = C \ , \]

where \(\phi \pi_i\) represents the risk loading for policyholder \(i\).

The total risk premium \(C\) of the whole portfolio is computed by bootstrapping the total amount of the portfolio.

boot_total <- function(original_vector, resample_vector) {
  sum(original_vector[resample_vector])
}

total_results <- boot(df$chargtot, boot_total, R = 2000)

We plot below the resulting distribution for 2000 replications.

plot(total_results)

Then we define \(C\) as being the 0.995 quantile of the resulting distribution (or 0.995 1Y-VaR):

(C <- quantile(total_results$t,0.995))
##    99.5% 
## 39748538

Solving the above question for \(\phi\), we obtain: \[ \phi = \frac{C - \sum_{i}^N \pi_i}{\sum_{i}^N \pi_i} \ .\]

We apply this formula on the 3 cases and add the risk loading riskL.glm/xgb/gbm in the premiums table.

# We merge the portfolio with the pure premium table.
df.claims <- df %>% left_join(premiums,by=c("sexp","agecat","agecar","fuelc","split","coverp","powerc"))
# We compute the phi parameter.
phi.glm <- (C-sum(df.claims$riskP.glm))/sum(df.claims$riskP.glm)
phi.xgb <- (C-sum(df.claims$riskP.xgb))/sum(df.claims$riskP.xgb)
phi.gbm <- (C-sum(df.claims$riskP.gbm))/sum(df.claims$riskP.gbm)
# We compute the risk loadings.
premiums <- premiums %>% mutate(riskL.glm=phi.glm*riskP.glm,
                                riskL.xgb=phi.xgb*riskP.xgb,
                                riskL.gbm=phi.gbm*riskP.gbm)

As illustration, we show the first/last 6 records below:

head(premiums)
tail(premiums)

Considering instead the expected shortfall, we obtain for \(C\):

(C.es <- mean(total_results$t[total_results$t > C]))
## [1] 40482898

The parameter \(\phi\) then becomes:

phi.glm.es <- (C.es-sum(df.claims$riskP.glm))/sum(df.claims$riskP.glm)
phi.xgb.es <- (C.es-sum(df.claims$riskP.xgb))/sum(df.claims$riskP.xgb)
phi.gbm.es <- (C.es-sum(df.claims$riskP.gbm))/sum(df.claims$riskP.gbm)

The new risk loadings are then computed and added to the premium table:

premiums <- premiums %>% mutate(riskL.glm.es=phi.glm.es*riskP.glm,
                                riskL.xgb.es=phi.xgb.es*riskP.xgb,
                                riskL.gbm.es=phi.gbm.es*riskP.gbm)

As illustration, we show the first/last 6 records below:

head(premiums)
tail(premiums)

9 Total risk premiums

Finally, we compute the total premiums using the value-at-risk and the expected shortfall approach.

premiums <- premiums %>% mutate(riskT.glm.VaR=riskP.glm+riskL.glm,
                                riskT.xgb.VaR=riskP.xgb+riskL.xgb,
                                riskT.gbm.VaR=riskP.gbm+riskL.gbm,
                                riskT.glm.ES= riskP.glm+riskL.glm.es,
                                riskT.xgb.ES= riskP.xgb+riskL.xgb.es,
                                riskT.gbm.ES= riskP.gbm+riskL.gbm.es
                                )

9.1 Table of risk premiums

The complete table is shown below. For clarity, only the information on the total risk premium is displayed in the first table below, the associated risk profiles (RP) being displayed in a separate table hereunder.

premiums %>% mutate(RP=1:n()) %>% select("RP",starts_with("riskT"))

9.2 Table of risk profiles

premiums %>% mutate(RP=1:n()) %>% select("RP","sexp","agecat","agecar","fuelc","split","coverp","powerc")

We recall that gender (and any proxy variable) should in practice be avoided in any pricing structure.

References

Henckaerts, Roel. n.d. “treeML.” GitHub. https://github.com/henckr/treeML/blob/master/treeML.nb.html .
Henckaerts, Roel, Katrien Antonio, Maxime Clijsters, and Roel Verbelen. 2018. “A Data Driven Binning Strategy for the Construction of Insurance Tariff Classes.” Scandinavian Actuarial Journal 2018 (8): 681–705. https://doi.org/10.1080/03461238.2018.1429300.
Henckaerts, Roel, Marie-Pier Côté, Katrien Antonio, and Roel Verbelen. 2020. “Boosting Insights in Insurance Tariff Plans with Tree-Based Machine Learning Methods.” http://arxiv.org/abs/1904.10890.
Katrien Antonio, Roel Verbelen, Roel Henckaerts. 2020. “Computer Lab 2.”
Southworth, Harry. n.d. “Gbm.” GitHub. https://github.com/harrysouthworth/gbm .
Yang, Liang, Zhengxiao Li, and Shengwang Meng. 2020. “Risk Loadings in Classification Ratemaking.” http://arxiv.org/abs/2002.01798.